Optimize Vector128<long> multiplication for arm64 #104177

EgorBo · 2024-06-28T19:53:41Z

Follow up to #103555 for arm64

EgorBo · 2024-06-28T19:53:56Z

@EgorBot -arm64 -profiler

using System.IO.Hashing;
using BenchmarkDotNet.Attributes;

public class Bench
{
    static readonly byte[] Data = new byte[1000000];

    [Benchmark]
    public byte[] BenchXxHash128()
    {
        XxHash128 hash = new();
        hash.Append(Data);
        return hash.GetHashAndReset();
    }
}

dotnet-policy-service · 2024-06-28T19:54:06Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

src/coreclr/jit/hwintrinsicarm64.cpp

tannergooding · 2024-06-28T20:03:54Z

src/coreclr/jit/gentree.cpp

+                case TYP_LONG:
+                case TYP_ULONG:
+                {
+                    assert(simdSize == 16);
+
+                    // Make op1 and op2 multi-use:
+                    GenTree* op1Dup = fgMakeMultiUse(&op1);
+                    GenTree* op2Dup = fgMakeMultiUse(&op2);
+
+                    // long left0 = op1.GetElement(0)
+                    // long left1 = op1.GetElement(1)
+                    GenTree* left0 = gtNewSimdGetElementNode(TYP_LONG, op1, gtNewIconNode(0), simdBaseJitType, 16);
+                    GenTree* left1 = gtNewSimdGetElementNode(TYP_LONG, op1Dup, gtNewIconNode(1), simdBaseJitType, 16);
+
+                    // long right0 = op2.GetElement(0)
+                    // long right1 = op2.GetElement(1)
+                    GenTree* right0 = gtNewSimdGetElementNode(TYP_LONG, op2, gtNewIconNode(0), simdBaseJitType, 16);
+                    GenTree* right1 = gtNewSimdGetElementNode(TYP_LONG, op2Dup, gtNewIconNode(1), simdBaseJitType, 16);
+
+                    // Vector128<long> vec = Vector128.Create(left0 * right0, left1 * right1)
+                    op1          = gtNewOperNode(GT_MUL, TYP_LONG, left0, right0);
+                    op2          = gtNewOperNode(GT_MUL, TYP_LONG, left1, right1);
+                    GenTree* vec = gtNewSimdCreateScalarUnsafeNode(TYP_SIMD16, op1, simdBaseJitType, 16);
+                    return gtNewSimdHWIntrinsicNode(TYP_SIMD16, vec, gtNewIconNode(1), op2, NI_AdvSimd_Insert,
+                                                    simdBaseJitType, 16);
+                }


Is this just avoiding the cost of inlining, unrolling, and simplifying the work the JIT would have to do?

EgorBot · 2024-06-28T20:14:56Z

Benchmark results on Arm64

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-OMCIXQ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-KTSNVH : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	116.9 μs	0.05 μs	1.00
BenchXxHash128	PR	109.8 μs	0.04 μs	0.94

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

…g-arm64

neon-sunset · 2024-07-01T00:50:27Z

I wanted to ask is there a reason LLVM's codegen variant did not work? On some cores, UMOV/SMOV has pretty bad latency vs code that avoids a round-trip to scalar registers.

tannergooding · 2024-07-01T16:39:16Z

src/coreclr/jit/gentree.cpp

+                    return gtNewSimdHWIntrinsicNode(type, vec, gtNewIconNode(1), op2, NI_AdvSimd_Insert,
+                                                    simdBaseJitType, 16);


nit: Use gtNewSimdWithElementNode(type, vec, gtNewIconNode(1), op2, simdBaseJitType, simdSize) which ensures all the optimal handling takes place.

Thanks! Applied

…g-arm64

tannergooding · 2024-07-02T18:12:50Z

src/coreclr/jit/gentree.cpp

+                        op1            = gtNewBitCastNode(TYP_LONG, op1);
+                        op2            = gtNewBitCastNode(TYP_LONG, op2);


Why bitcast instead of ToScalar? If this is generating better code, it seems like a pretty "core" scenario we're not handling from the ToScalar path

@tannergooding because op2 can be either 8-byte TYP_SIMD8 or 8-byte scalar (TYP_LONG) so bitcast allowed me to simplify handling. In my initial version I forgot that this path is used for both MUL(vector, vector) and MUL(vector, scalar) (where scalar is broadcasted)

Ah, that makes sense, 👍

Optimize Vector128<long> multiplication for arm64

ef0e46f

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 28, 2024

dotnet-policy-service bot assigned EgorBo Jun 28, 2024

tannergooding reviewed Jun 28, 2024

View reviewed changes

src/coreclr/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 28, 2024

View reviewed changes

This was referenced Jun 28, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

EgorBo added 5 commits June 30, 2024 22:38

Merge branch 'main' of https://github.com/dotnet/runtime into mul-lon…

717f62a

…g-arm64

add Vector64

77977f5

remove assert

e0a2942

add a comment

86d4fb3

clean up

818b8bd

build-analysis bot mentioned this pull request Jul 1, 2024

Test failure: GC\\Scenarios\\FinalizeTimeout\\FinalizeTimeout\\FinalizeTimeout.cmd #103874

Closed

tannergooding reviewed Jul 1, 2024

View reviewed changes

tannergooding approved these changes Jul 1, 2024

View reviewed changes

EgorBo added 2 commits July 2, 2024 15:59

Merge branch 'main' of https://github.com/dotnet/runtime into mul-lon…

f7a53d3

…g-arm64

handle scalarOp

b2206c9

EgorBo marked this pull request as ready for review July 2, 2024 18:03

EgorBo merged commit 6e039a8 into dotnet:main Jul 2, 2024
104 of 107 checks passed

EgorBo deleted the mul-long-arm64 branch July 2, 2024 18:03

tannergooding reviewed Jul 2, 2024

View reviewed changes

github-actions bot locked and limited conversation to collaborators Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Vector128<long> multiplication for arm64 #104177

Optimize Vector128<long> multiplication for arm64 #104177

EgorBo commented Jun 28, 2024

EgorBo commented Jun 28, 2024

dotnet-policy-service bot commented Jun 28, 2024

tannergooding Jun 28, 2024

EgorBot commented Jun 28, 2024

neon-sunset commented Jul 1, 2024 •

edited

Loading

tannergooding Jul 1, 2024

EgorBo Jul 2, 2024

tannergooding Jul 2, 2024

EgorBo Jul 2, 2024 •

edited

Loading

tannergooding Jul 2, 2024

		return gtNewSimdHWIntrinsicNode(type, vec, gtNewIconNode(1), op2, NI_AdvSimd_Insert,
		simdBaseJitType, 16);

		op1 = gtNewBitCastNode(TYP_LONG, op1);
		op2 = gtNewBitCastNode(TYP_LONG, op2);

Optimize Vector128<long> multiplication for arm64 #104177

Optimize Vector128<long> multiplication for arm64 #104177

Conversation

EgorBo commented Jun 28, 2024

EgorBo commented Jun 28, 2024

dotnet-policy-service bot commented Jun 28, 2024

tannergooding Jun 28, 2024

Choose a reason for hiding this comment

EgorBot commented Jun 28, 2024

neon-sunset commented Jul 1, 2024 • edited Loading

tannergooding Jul 1, 2024

Choose a reason for hiding this comment

EgorBo Jul 2, 2024

Choose a reason for hiding this comment

tannergooding Jul 2, 2024

Choose a reason for hiding this comment

EgorBo Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

tannergooding Jul 2, 2024

Choose a reason for hiding this comment

neon-sunset commented Jul 1, 2024 •

edited

Loading

EgorBo Jul 2, 2024 •

edited

Loading